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ABSTRACT 

A common problem in physics is to fit regression data by a parametric class of func- 
tions, and to decide whether a certain functional form allows for a good fit of the data. 
Common goodness of fit methods are based on the calculation of the distribution of 
certain statistical quantities under the assumption that the model under consideration 
holds true. This proceeding bears methodological flaws, e.g. a good "fit" - albeit the 
model is wrong - might be due to over-fitting, or to the fact that the chosen statisti- 
cal criterion is not powerful enough against the present particular deviation between 
model and true regression function. This causes particular difficulties when models 
with different numbers of parameters are to be compared. Therefore the number of 
parameters is often penalised additionally. We provide a methodology which circum- 
vents these problems to some extent. It is based on the consideration of the error 
distribution of the goodness of fit criterion under a broad range of possible models 
- and not only under the assumption that a given model holds true. We present a 
graphical method to decide for the most evident model from a range of parametric 
models of the data. The method allows to quantify statistical evidence for the model 
(up to some distance between model and true regression function) and not only ab- 
sence of evidence against, as common goodness of fit methods do. Finally we apply 
our method to the problem of recovering the luminosity density of the Milky Way 
from a de-reddened COBE/DIRBE L-band map. We present statistical evidence for 
flaring of the stellar disc inside the solar circle. 

Key words: methods: data analysis - methods: statistical - Galaxy: disc - Galaxy: 
structure. 



1 INTRODUCTION 

Often one is confronted with the problem to reconstruct 
an unknown function /(ti) from noisy observations yi = 
y(ti),i — 1,...,N. Astrophysical examples include rever- 
beration mapping of gas in active galactic nuclei and recov- 
ery of the spatial (three-dimensional) luminosity density of 
a galaxy from blurred observations of its surface brightness. 
See e.g. Lucy (1994) for more examples of astronomical in- 
verse problems. In this paper we are concerned with a new 
method to compare several competing parametric models 
for the regression function /. 

Due to the noisy measurements it is tempting to assume 
that yi = f(ti)+Ei, where the £j denote some random noise 
and f(ti) the expected value of yi, i.e. E[yi] = f(ti). In 
particular we allow for different error distributions of the Sj, 



which entails inhomogeneous variance patterns, viz. V[si] = 
of, as will be the case in our example of de-projecting the 
de-reddened COBE/DIRBE L-band surface brightness map 
of Spergel et al. (1996), as discussed by Bissantz & Munk 
(2001, [BM1]). 

It is a common proceeding to fit a class of functions U = 
{/,? : $ G O} (parametric model) to the data yi. The 
parametric model may depend on a parameter where 
$ G O C M d . A popular method to select a "best-fitting" 
t? from O is to minimise the empirical mean squared error 
(MSE) 

N 

Q%(&):=Y,( Vi -U(ti)) 2 (1) 

i=l 

or weighted variants of it. This gives the least squares 
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estimator (LSE) of i9. Other measures of goodness of fit are 
e.g. L 1 -error criteria, where the absolute deviation between 
yi and f$(ti) is considered (Seber & Wild 1989). A small 
value of Qjv^) often is used as an indication for a good 
explanation of the observations by the model /§. Note that 
in regression models where the noise is inhomogeneous the 
quantity Q 2 N ($) is often useless (cf. [BM1] for an explana- 
tion) and more subtle methods have to be applied. 

An advantage of the parametric fitting methodology in con- 
trast to nonparametric curve estimation, i.e. approximating 
the data by arbitrary functions (e.g. by splines, orthogonal 
series or wavelets, cf. Efromovich, 1999 or Hart, 1997) is the 
fact that often physical reasoning resulting from a theory 
suggests such a class of functions U. Furthermore, subse- 
quent data analysis and interpretation becomes very simple 
if once a proper f$ is selected. Hence it is an important task 
to pre-specify U correctly in order to obtain a reasonable fit. 
Therefore in this paper we discuss the problem of evaluat- 
ing the goodness of fit of a parametric model U . Moreover, 
we offer a graphical method which allows to select a proper 
model U from a class of different models U = {Uk}k=i,...,h 
say. 

A common proceeding is to assume that the model holds, 
and to test if the observed data give reason to reject the 
model. This type of goodness of fit tests is performed by eval- 
uating the probability distribution of a pre-specified measure 
of discrepancy, such as Q%($)- This is done under the as- 
sumption that U holds true. Then, when this measure ex- 
ceeds a certain quantity, the model U is rejected. 

One problem of such methods is that a large data set leads 
essentially to rejection of any model U (an illustrative dis- 
cussion can be found in Berger, 1985), because the "real 
world" is never exactly described by such a model and as 
the number of observations increases, statistical methods 
will always detect these deviations between the model and 
"reality". Conversely, the selected statistical criterion may 
lead to a decision in favour of U (albeit wrong) , because it is 
not capable to detect important deviations from U or the de- 
cision is affected by quantities which are not captured in the 
model U (e.g. correlation between the yi). Another problem 
can be over-fitting of data by models with a too large num- 
ber of parameters. Therefore, various methods have been 
suggested which penalise the number of parameters, i.e. the 
complexity of a model (Akaike, 1974, Burnham et al., 1998, 
or Schwarz, 1978). 

In this paper, we suggest a methodology which aims to avoid 
these problems by considering the distribution of a discrep- 
ancy measure such as <2jv(i?) under all "possible" functions 
/. This extends the method given in [BM1] to the more re- 
alistic case where the "true" function / is not restricted to 
be in U. Furthermore, a graphical method will be presented 
which allows to select the most appropriate between several 
competing models Ui. With our method, this is still possible 
if these models have different numbers of parameters. 



In the next section we will describe the method and its al- 
gorithmic implementation, the wild bootstrap. Based on the 
theory presented in sect. 2, we suggest in sect. 3 a graph- 
ical method to assess the validity of U as well as to com- 
pare between different models. This method is denoted as 
p-value curve analysis. In sect. 4 our method is applied to 
a near-infrared [NIR] L-band map of the Milky Way [MW] 
and two different models of the spatial luminosity distribu- 
tion are compared. One of the models includes a flaring disc 
component. We analyse the models' p-value curves, and find 
that flaring in the disc improves the fit to the data. 



2 A NEW METHOD OF MODEL SELECTION 

In section 2.1 we briefly recall the methodology suggested 
in [BM1] and extend it to the situation where / is not in 
the model U. This will be used to compute p-value curves, 
a graphical method of model diagnostics, which was intro- 
duced by Munk & Czado (1998) in a different context. In 
sect. 2.2 we describe the practical application of the method. 

2.1 Basic theory of the method 

We begin with an introduction to the basic principles of our 
method. As mentioned above Q%("&) fails to be a valid crite- 
rion for goodness of fit in inhomogeneous models [BM1] . In- 
stead we replace the pure residuals yi — f$(ti) with smoothed 
residuals, to allow for a valid statistical analysis. For the 
smoothing step we require an injective linear integral oper- 
ator with kernel T, viz. 

5 W = T(/)H= J T(w,v)f(v)dv 

which maps the function / to be recovered onto g. In princi- 
ple any injective operator T is a valid option for the smooth- 
ing, however a good choice is driven by aspects such as effi- 
ciency and simplicity. In our example (cf. sect. 4) we intro- 
duce "cumulative smoothing" with T(w,v) = min(w,u). An 
extensive simulation study by Munk & Ruymgaart (1999) 
revealed this smoothing kernel as a reasonable choice which 
yields a procedure capable to detect a broad range of devi- 
ations from the class of functions U = {f$ : $ G O}. 

A measure of the discrepancy between the "true" / and U 
is the transformed distance 

D 2 (/)=min|!T(/-,^)|| 2 (2) 

where the norm refers to some L 2 -norm. Now assume that 
the minimum in eq. ^ is achieved at a parameter vector $* = 
€ O. Because is unknown it has to be estimated 
from the data. This can be done by numerical minimisation 
of the empirical counterpart of the r.h.s. of eq. || 

D 2 :=mm\\T,U-g\\ 2 
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where 

ft 

g = AT- 1 ^j /i T( U ,i i ) 

i=l 

is an estimation of g using the noisy data j/i. The reasoning 
behind this approach is that for sufficiently large number N 
of observations (in our example iV = 4800) it can be shown 
that g converges in probability to the true (but unknown) 
function g, independently whether a parametric model U is 
valid or not. On the other hand the empirical minimiser #t 
estimates the best possible fit to g by the model U. The 
resulting estimator is denoted as a smoothed minimum dis- 
tance estimator $t (SMDE) and has the property that, if 
the true function / = g&* is in U, i?t — * &* as the sam- 
ple size increases. For detailed proofs we refer to Munk & 
Ruymgaart (1999). 

Note that ■&* is the "true" best-fitting parameter vector, 
which could only be determined if the data would be free of 
noise, whereas i?t is an estimation of the best-fitting param- 
eter vector using the noisy data. Here and in the following, 
quantities with a hat, a "" t are estimated from the noisy 
data, whereas such without a hat are the "true" functions 
to be recovered. 

Munk & Ruymgaart (1999) showed that the probabilistic 
limiting behaviour of D 2 depends on whether / belongs to 
the model U under investigation. More precisely when / 
belongs to U the distribution of ND 2 is for large N approx- 
imately that of 

oo 

£>X? ( 3 ) 

i=l 

where Xi denotes a sequence of independent squares of stan- 
dard normal random variables and Xi > is a sequence of 
real numbers, s.t. J^^i ^? < °°> which depend on the 
distribution C of errors £; and the operator T. 

In contrast if / does not belong to U, we have D 2 > and 
— D 2 ^ tends for large iV to a centred normal distri- 
bution with variance a\. c depending on T, #*, and C. 
Observe, that we obtain two different types of distributions, 
accordingly to the situation whether the "true" (unknown) 
function / is in the model U or not. Because of the com- 
plicated dependency of the (Ai)igjv and o\ £ ^» on T, 
and C a resampling algorithm should be applied in order to 
approximate these limiting distributions. Stute et al. (1998) 
presented a wild bootstrap algorithm which can be used to 
approximate the law ND 2 . Munk (1999) showed that this 
algorithm is also valid when / does not belong to U , which 
is crucial for our paper. This algorithm will be carefully ex- 
plained in the next paragraph. Recall that the subsequent 
bootstrap algorithm allows to determine the probability dis- 
tribution of the quantity of interest D 2 . The general strategy 
of our method will be the following. Because D 2 measures 
the distance between the model U and the estimator g from 
noisy data, knowledge of the probability distribution of D 2 
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Figure 1. Binary probability distribution required in step 2 of 
the wild bootstrap algorithm. The ordinate gives the probabil- 
ity of the random number to be — (y/S + l)/2 and (v5+ l)/2, 
respectively. 

(which will be determined by the subsequent bootstrap al- 
gorithm) allows us to quantify whether an observed value of 
D 2 for a model U is more likely than for a competing model 
U' , say. Even, when none of these models is completely true 
(which is always the case in the real world) D 2 quantifies the 
best possible approximation of g by U or U' respectively. 

2.2 Practical application of the method 

We now introduce the resampling algorithm to approximate 
the law ND 2 . The algorithm starts with the determination 
of the SMDE #t and the smoothed residuals between this 
model and the data (step 1). Then in step 2-5 the resampling 
part of the algorithm follows. The same algorithm is used in 
[BM1]. 

Step 1: (Generate residuals). Compute residuals 
Si ■= Vi - U T (ti), i = l,---,n 
where t?t denotes a solution of the minimisation of 
D 2 ~X 2 ($t) := min||3-TM| 2 . 

Step 2: (The "wild" part). Generate new random vari- 
ables c*, i — 1, . . . , n, which do not depend on the data, 
where each c* is distributed to a distribution which assigns 
probability (y/E + l)/2y/E to the value (-y/E - l)/2 and 
(y/E - 1) /2y/E to the value (y/E+l) /2. See fig. for a visu- 
alisation of this probability distribution. 

Step 3: (Bootstrapping residuals). Compute e* := iiC* and 
Vi = /j T + £ i ■ This gives a new data vector (y* , tj)j=i,...,n. 

Step 4: (Compute the target). Compute D 2 with 

(yt,ti)i=l,...,n 

Step 5: (Bootstrap replication). Repeat step 1-4 B times 
which gives values D\ D% t . B is a large number, typ- 

ically B = 500 or B = 1000 is sufficient. 
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From the bootstrap replications b\ t , . . . , b 2 B t we compute 
the quantities 

xi = Vn(dI* -D 2 ) ,...,x b = VN - D 2 ) , 

using the number of data points N. The xi,...,xb arc 
realisations of the random quantity X = \f~N (t) 2 — Z) 2 ). 
It can be proved that the empirical distribution function 
of £> 2 „, . . . ,£>% „ yields an approximation to the true dis- 
tribution of D 2 after a proper re-centring, i.e. the cu- 
mulative probability distribution function F B of X = 
y/~N (t) 2 — I) 2 ) is close to the cumulative distribution func- 
tion oiyfN (b 2 - D 2 ) for any D 2 > (Munk, 1999). 

An important application of this result is to determine 
an approximation to the probability p(t,D 2 ) that D 2 is 
below a certain value t, provided the distance between 
true function / and the model is D 2 . To this end we use 
that F B , found from the bootstrap replications, approxi- 
mates the (unknown) cumulative probability distribution of 
^/N(b 2 — D 2 ). The latter distribution allows to determine 
p(t, D 2 ). Hence we are in the position to compare the proba- 
bility that the observed value of D 2 is achieved in all " possi- 
ble worlds", i.e. for any possible /. In fact, it turns out that 
this probability does only depend on / via D 2 (f), which al- 
lows a nice geometric interpretation as we will illustrate in 
the following. 

We will use the asymptotic similarity of the two cumulative 
probability distributions in the following section to estimate 
the probability p(t,D 2 ). From this we then define the P- 
value curve a at (II), which can be regarded as a measure of 
evidence for D 2 < IT, given D 2 and F B . Thus these quanti- 
ties allow to constrain D 2 for a parametric model of a given 
set of data. 



3 P- VALUE CURVES 

The main methodology we propose in this paper is the com- 
putation of a p- value curve as a graphical tool for illustrating 
the evidence of a model. To this end we plot the function 
Qjv(n) = F B (\/N (b 2 - n)) for n > 0, i.e. the value of 
ajv(II) is given by the probability that the random quantity 
X = yfN(b 2 ,-b 2 ) is smaller than (b 2 - II) . Note 
that this implies that for II increasing ajv(II) decreases, be- 
cause we then evaluate the cumulative distribution function 
F B (x) for decreasing x, and in particular, if qjv(II) is small, 
at the left tail of F B . 

The interpretation of the function ajv(II) is as follows. As- 
sume the true distance between model U and / (i.e. the dis- 
tance between the minimising /#* and the "true" function 
/) is D 2 = n. If this holds, the probability that^/V \D 2 - D 2 ) 
is smaller than some value t is given as 

P D , =Yl (VN (b 2 - D 2 ) < t) « F* B {t) (4) 

where the r.h.s. denotes the bootstrap approximation to the 
true distribution function on the l.h.s. Now we reject the 



hypotheses H : D 2 > II (vs. alternative K : D 2 < II) when- 
ever ajv(n) < a for a given level of significance a. Hence 
1 — Qjv(n) can be regarded as the estimated evidence in 
favour of the model U (up to a distance between model and 
data D 2 < U). 

Note that this approach highlights the fact that finally the 
astrophysicist has to decide whether a value of D 2 — II 
should be regarded as scientifically negligible or as devia- 
tion from the model U which is considered as too large by 
astrophysical reasons. We mention that the classical good- 
ness of fit tests do not offer the scientist the specification of 
such a value n. 

How can an upper bound for a just acceptable D 2 be de- 
termined? One simple suggestion is to compute the distance 
D 2 = ||T/^ t - T/^ t || 2 between the best model and 
"test models" . Such test models should then be con- 
structed from by adding (systematic) deviations to the 
model, which are still considered as scientifically negligible 
differences to the best model. Then, if D 2 is not larger than 
the average over the test models < u >, computed from a 
number of such test models, it is considered as scientifically 
negligible. 

Observe that with our proposed method the statistical type 
one error is the error to decide for the model (or more precise 
for a neighbourhood D 2 < II of the model) although it is not 
valid. Classical goodness of fit tests are only able to control 
the error of rejecting the model albeit it holds, i.e. they are 
based on testing H : D 2 = vs. K : D 2 > 0. 

Fixing b 2 , a small value of ajv(n) indicates large probability 
for D 2 < n and a large value (close to 1) of ajv(n) indicates 
a large probability for D 2 > n. It is important to note that 
the interesting regions of the resulting curves Qjv(n) are 
those values of n where ajv(n) is rather large (larger than 
0.9 say) and rather small (smaller than 0.1) in accordance 
with the usual choice of levels of significance. In contrast 
decisions based on ajv in regions where ajv(n) w 0.5 would 
correspond to flipping a coin in order to decide whether 
D 2 < n or not. 

As an important advantage of p-value curves we find that 
it gives us not only an estimated probability (p-value) that 
we would observe a test statistic (such as Q 2 N ($) or D 2 ) 
provided the assumption that U underlies the data is true. 
Rather we obtain simultaneously all scenarios over the entire 
range of "possible worlds" which are parametrised by D 2 . In 
particular this implies that models with a large number of 
parameters are penalised in an automatic way. As the num- 
ber of parameters increases the variability of the statistic 
b 2 increases and hence the variability of F B , i.e. the range 
of values for X, for which F B differs significantly from 
and 1, is larger. On the other hand the bias is reduced. As 
the number of parameters decrease the opposite will be the 
case. This leads to a curve a„(n) which slowly decreases to 
zero if the variance is too large or if the bias is too large. 
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Figure 2. Typical cases for p-value curve comparison of two parametric models. The vertical lines at El = 0.08 in the graphs indicate 
the observed value of D 2 = 0.08. In graph 1, model 1 fits better, as well as in graph 2. However in graph 2 is additionally strong evidence 
that model 1 does not hold. Graph 3 is again an example with model 1 the better model. Finally in graph 4 the situation depends on 
the assumption of the distance between the parametric model U and the true regression function / (cf. sect. 2) D 2 . 



Hence evidence for a small II can only be claimed if these 
two quantities are balanced. 

In other words a p-value curve reflects automatically the 
tradeoff between variance and bias in a regression. Here the 
bias of the regression functions can be viewed as the dif- 
ference between the "true" expectation value E[yi] and the 
value of the regression function f(U). The variance provides 
an estimate of the uncertainty of the best-fitting parameters 
7? or i9 T . 

Before we analyse two competing models for the structure of 
the MW we illustrate in an artifical example typical features 
of p-value curves. In fig. ^various scenarios are displayed. In 
graph 1 model 1 beats model 2 at all fronts. The estimated 
evidence for D 2 < II is uniformly larger for any n > 0. This 
coincides with "classical testing" because also the classical 
p-value for testing H:D 2 = is larger. Observe that the 
classical p-value corresponds in this graph to 1 — ajv(0). 

Graph 2 is similar, observe however, that a classical analysis 
would indicate that here is additionally strong evidence that 
model 1 does not hold (ajv(0) > 0.9), although it yields a 
better fit as model 2, exactly as in graph 1. Here the value 



of n where ajv(n) = 0.1 (i.e. where n « 0.7), say, becomes 
important because it gives an idea of the order of magnitude 
between model U and the true regression. Hence it has to 
be decided for the particular problem whether a distance of 
n « 0.7 is considered as "large" or scientifically irrelevant. 

Graph 3 represents a typical case of over-fitting by model 
2. Classical reasoning would prefer model 2 because ajv(0) 
is smaller and hence the classical p-value larger. However, 
we see that this is due to a lack of power of the used test 
statistic, because the slope of the curve is very flat due to 
a large variability of the test statistic. Hence there is not 
much support for the decision D 2 < 0.5, say, (cvjv(0.5) R3 0.3) 
whereas model 1 yields a at (0.5) ~ 0.03. Thus there is strong 
evidence that the distance between model 1 and the true 
regression curve is smaller than 0.5, say. 

Finally in graph 4 both models are acceptable with slight 
preference to model 2 provided a distance of 0.2 (the point 
of intersection of both curves) is considered as an acceptable 
distance between U and /. If a larger distance, n = 0.5, say 
is considered to be tolerable, however model 1 has to be 
preferred. 
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4 FLARING OF THE STELLAR DISC 

Observations have shown that the HI disc of the MW flares 
(see, for example, Merrifield, 1992, or Malhotra, 1995). The 
situation is much less clear for the stellar disc. Alard (2000) 
finds flaring for the disc outwards of the solar orbit, from 
an analysis of 2 micron sky survey (2MASS) data, with a 
vertical scale-height ~ 300 pc in the solar neighbourhood. 
Other evidence comes from Kent et al. (1991), who have 
fitted parametric models to Spacelab2 IR telescope (IRT) 
2.4/im observations of the MW. The vertical scale-height of 
their best model's disc is constant in the inner « 5 kpc with 
h z = 165 pc, but rises outside of this galactocentric radius 
to h z ~ 247 pc in the solar neighbourhood. Thus the results 
of Alard and Kent et al. for the vertical disc scale-height in 
the solar neighbourhood are consistent to within « 20%. 

We apply our proposed method to dust-corrected 
COBE/DIRBE L-band data (Weiland, 1994; Spergel et al., 
1996), and investigate whether there is evidence for flar- 
ing of the disc inside the solar orbit. We remark that this 
L-band observations are expected to trace the density of 
stars (Binney, Gerhard & Spergel, 1997 [BGS]). Note that 
from non-parametric models of this L-band data [BGS] have 
found vertical scale-heights zo w 120 — 150 pc at R = 5 kpc 
from the galactic centre. They remark that this is inconsis- 
tent with the value of 300 pc from star counts at the Galac- 
tic poles (Gilmore & Reid, 1983), but consistent with the 
findings of Kent et al. (1991). We will apply our proposed 
statistical test on the same data to demonstrate its ability 
in this context as an example application. 

The general outline of this sect, is as follows: First we intro- 
duce the observational data (sect. 4.1), and construct func- 
tional forms for two different parametric models of the MW 
luminosity density distribution (sect. 4.2). Then (sect. 4.3) 
we fit these models to the COBE/DIRBE L-band data and 
apply the wild bootstrap algorithm to both models, with 
B = 5000. Finally we analyse the distribution of the dis- 
tances D 2 between the models and the data, both under 
the assumption that the respective parametric model does 
reproduce the data, and that this is not the case (sect. 4.4). 



4.1 Observational data 

The DIRBE experiment on board the COBE satellite, 
launched in 1989, has provided maps of the sky in several in- 



frared wavebands (Weiland et al. (1994)). This data has been 
used to estimate the luminosity distribution of the MW, 
both parametrically (e.g. Freudenreich, 1998, and Dwek et 
al., 1995), and non-parametrically (Binney & Gerhard, 1996, 
[BGS], Bissantz et al., 1997, and Bissantz & Gerhard, 2001). 

In this paper we use a COBE/DIRBE NIR L-band map, 
corrected for dust absorption by Spergel et al. ((1996)). The 
resolution of the equidistant grid of data is n — 120 points in 
-89.25 deg < I < 89.25 deg and m = 40 points in -29.25 deg < 



b < 29.25 deg. We only use the data -60 deg < I < 60 deg, 
— 20deg<6< 10 deg, to downweight those parts of the sky 
where non-informative parts in the data can be observed due 
to extreme noise (cf. [BM1]). 

This dataset is well suited to demonstrate our proposed 
method since it consists of several thousand data points, 
enough to make the method applicable. Simulations have 
shown that the method is already applicable when more than 
50 data points are available provided the error distribution 
behaves well. 



4.2 The parametric models 

We construct two different parametric models, one including 
flaring, according to the approach of Kent et al. (1991), the 
other not. In this section the functional forms of the mod- 
els are presented, first the individual bulge and disc com- 
ponents. We use a Cartesian coordinate system with axes 
x,y,z. Here x is along the major axis, and y along the mi- 
nor axis of the bulge/bar, both in the main plane of the MW. 
We set the position of the sun in this coordinate system to 
a distance from the main plane of the disc zq = 14 pc, the 
distance to the galactic centre Rq — 8 kpc, and the angle 
between the major axis of the bar and the line-of-sight from 
the sun to the galactic centre (j>bar = 20 deg ([BGS]). Let 



components are: 



+ 



and r = x + y ■ Then the model 



"BGS" bar/bulge: The bulge model is selected similar to 
[BGS] . It is a truncated power law bulge: 

2 / 2 

p(x,y,z) = b ■ 



Om??C (1 + a/a c ) q 
"BGS" disc: A double-exponential disc, without flaring 
[BGS]: 

p(x,y,z) = d- (e- ]zWzo /z + ae- ]z]/zi ■ r d e- r/r « 

"Kent" disc: A double exponential disc, similar to the 
"BGS" disc. But now we include flaring, in spirit of the 
flaring disc model of Kent et al. (1991). Inside of a galac- 
tocentric radius Ri — 5 kpc the scale-height h z is constant. 
Outside of Ri it rises linearly to the solar neighbourhood, 
where the scale-height is 247 pc. We also set the radial disc 
scale length to the Kent et al. value of r d = 3.001 kpc outside 
Ri. Inside Ri the radial scale length is a fit variable. 
Thus we define for r < 5 kpc: 

p(x,y,z) = d ■ (e" |z|/zo /z + ae" |z|/zi /zi) 

— T I T fi 

■r d e 

for r > 5 kpc, with a = z + (0.247 kpc - z ) ■ r/[k P c] ~ 5 : 



p(x,y,z) 



d - ( e - |z|/CT /a + ae- |z|/zi / Zl ) 

„ -r/3.001 kpc„5(3.001 _1 

■r d e e 



[kpc]) 



We remark that this definition of the disc ensures p G 
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C° (^R 3 ), in particular that p is continuous at {(x,y,z) : 
r — x 2 + y 2 — 5 kpc} . 

We remark that we assume a priori some of the parameters 
as fixed. These are the disc parameters a = 0.27 and scale- 
height 21 = 42 pc, and the cusp parameters a c = O.f kpc 
and q = 1.8 [BGS]. We refer to Kent et al. (1991), [BM1] 
and [BGS] for a more detailed description of the models and 
their parameters. 

Using these model parts we define in table 1 two models 
which we analyse. 



m m 



-|— i— i— i— i— p 



t — I — I — r - 



-i — i — i — r - 



~i — i — i — i 



10 c 



Model "noflare" 

_ Model "flare" 



2x10 



8 



3x10 



8 



4x10 



8 



4.3 The fitting algorithm 

The algorithm in order to find estimates for the parame- 
ters in the above mentioned models is discussed in detail in 
[BM1] and only briefly summarised here. 

Our task is to fit the de- reddened COBE/DIRBE L- 



band surface brightness map Yij 



*{U,bj) (Spergel et 



al. 1996), which is blurred by some random error at 
position (h,bj). Particularly, an explorative data analysis 
shows that it is necessary to allow for a position dependent 
noise Var[£jj] = ofj (cf. [BM1]). The linear integral oper- 
ator V projects a three-dimensional luminosity distribution 
p(x,y,z) to a surface-brightness distribution b) at the 
sky, viz: 



<j(l,b)=P(j>) (l,b) 



p(r,l,b)dr, 



with p(r,l,b) = p(x(r,l,b),y(r,l,b), z(r,l,b)) where p is de- 
fined in [BM1]. We assume that V is injective in a neigh- 
bourhood of U = {p$ (-)}iJee- This depends on a proper 
selection of the parametric model U. Thus the problem to 



solve is to recover the MW luminosity density p MW from 

n ib . )siJ MW {li)b . )+E .. = 

MW 



the noisy integral equation ui° s (U 1 u J ; 



<p ^MWj fab^+eij. Note that uj mw is the noise-free sur- 
face brightness distribution of the MW. 

Following the method proposed in [BM1], let u}&(l,b) = 
V (p&) (I, b); i? G 0, and consider the transformed model 

U T = TU = {T m (l,b)}# ee , 

with 



(Tw) (u,v) 



u(l,b)T((u,v), (l,b))dldb. 



Here T is a smoothing integral operator with kernel 
T ((«, v) , (I, b)) = min{w, 1} ■ min{w, 6}; (u,v),(l, b) £ R 2 . 
Munk & Ryumgaart (1999) have shown that this smoothing 
kernel is a reasonable choice (cf. sect. 2 and [BM1]). 



According to sect. 2 we estimate the smoothed MW sur- 
observations Yij =u)° B {h,b 



face brightness g MW (u, v) = (Tcj mm/ ) (u, v) from the noisy 

as 



g [u,v) = 



Figure 3. Distribution Fg of X where X =^N(t) 2 - D 2 ) for 
our models of the COBE/DIRBE L-band data. The solid line 
corresponds to model "noflare" , the dashed line to model "flare" . 



Model "noflare" 
Model "flare" 




2x10 



4x10" 



n 



Figure 4. The p- value curves ajv(n) = F* (y/N(D 2 - II)) for 
our two parametric models of the MW luminosity density distri- 
bution. Model "flare" (dashed line) is better than model "noflare" 
(full line) under the assumption that none of the models holds 
true for the data. 



and 



determine 



numerically 



Toj,j||2, where 



the SMDE 1? T = argmin d6e ||p M 
denotes the usual L 2 -norm. Finally the minimising value 



is computed. 



1?X 1 



To this end we use the Marquardt-Levenberg-algorithm 
(Press et al., (1994)) for the minimisation in a two-step pro- 
cess: 

1. Fitting of the disc parameters: In the first step we fit 
the disc parameters and the bulge normalisation b, with the 
other bulge parameters fixed. 

2. Fitting of the bulge/bar parameters: In the second step 
we fix the disc related parameters found in the first step 
(except for the normalisation parameter d) and fit the 
bulge/bar parameters and d. 



i=i j=i 
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Model 


bulgemodel 


discmodel 






"BGS-bulge" 


"BGS-disc" 


"Kent-disc" 


piioflarc 


X 


X 




pflare 


X 




X 



Table 1. Combinations of the bulge/bar and disc model components to models. 



4.4 P- value curve analysis of our MW models 

Finally we analyse the MW models with our new method. 
To this end we first apply the fitting algorithm (sect. 4.3) 
to both model "flare" and "noflare". From this we obtain 
the best-fitting functions jfl aro / noflaro ) an d the correspond- 
ing distances of the models from the data £)2,flar C /noflar C 
Here superscript flare/noflare indicates that we determine 
this functions for model "flare" and (separately) for model 
"noflare". Then we perform the bootstrap analysis (sect. 
2.2) for both model "flare" and "noflare", using the cor- 
responding best fit function and D 2 for the respective 
model, and B = 5000. This yields the empirical cu- 
mulative probability distribution functions j^. narc /noflarc 
of X = v A/v(D 2 ' flare/noflare -n), and thus the functions 
^flare/noflare Yig. presents the resulting empirical distri- 
bution functions p^P are / no t iarE ^ anc j pjg ^ the p-value curves 

flare /noflare /tt\ 

a N (II). 

The last step in the analysis is to perform the graphical 
analysis of the p-value curves shown in Fig. ||. For every 
assumed distance II between model and the "true" function 
/ we find more evidence for model "flare" than for model 
"noflare" (cf. sect. 3). Therefore we are in the situation of 
graph 1 in fig. |^ and conclude that the p-value curve of model 
"flare" yields significantly more evidence for this model than 
that of model "noflare". Hence inclusion of flaring in the 
stellar disc improves the model. We can exclude that this 
conclusion is due to over-fitting, because the entire p-value 
curve performs better. 

However note that the present analysis provides much more 
information than in [BM1], namely that there is more sta- 
tistical evidence for "flare" as for "noflare" and not only less 
evidence against "flare" compared to "noflare". This is be- 
cause the method in [BM1] is based on the assumption that 
the model holds (as essentially all classical goodness of fit 
procedures do) , and therefore "only" helps to decide whether 
the model should be rejected given the observations. In con- 
trast to this, for the new method proposed in this paper we 
assume that "the model does not hold", and estimate the 
probability that the distance between model and the "true" 
function / is smaller than any assumed distance n. Thus 
variation of n allows to find a (statistical) upper bound for 
the distance between model and data, providing evidence for 
the model (within the limits of the chosen distance between 
model and the true function p). 

We conclude that inclusion of flaring in the double- 



exponential disc improves the fit to the COBE/DIRBE L- 
band data. This result is non-ambiguous, in particular since 
the curve of model "flare" is below the curve of model 
"noflare" over the entire scenario of possible distances n 
in fig. % 

Determining n such that ajv(n = D 2 ) « 0.1 yields an es- 
timate for the distance between the best models with and 
without flaring disc component, respectively, and the true 
density distribution of the MW. We find n w 3 x 10 s . This 
value can be considered as scientifically negligible because it 
is approximately equal to the distance D 2 between the the 
best fitting parametric model of [BGS] and a variant thereof 
in which the parameters have been changed in a random way 
by only « 1%. 



5 FINAL REMARKS AND CONCLUSIONS 
5.1 Statistical Methodology 

We have suggested a method which allows to assess the va- 
lidity of a regression model at a controlled error rate. The 
error rate is fixed by deciding how large D 2 may at the most 
be while still being considerable as scientifically negligible. 
Furthermore, several models can be compared. This compar- 
ison is still possible if the models are parametrised by differ- 
ent numbers of parameters since our method is sensitive to 
over-fitting of the data. It is worthwhile to comment briefly 
on possible relationships to other approaches. As pointed 
out by a referee, our approach is based on weighted least 
squares and hence, in a model with normal heteroscedas- 
tic errors, this is the maximum likelihood estimator (MLE). 
Note, however, that our approach does not require the as- 
sumption of a normal error, in general. 

Other approaches in the literature are based on Bayesian 
ideas, e.g. model averaging where the aim is to maximise 
the aposteriori probability of a model Uk, say, given the 
observations Y , i.e. 

l 

P(T\Y) =J2 P (T\U k ,Y)P(U k \Y) (5) 
fc=i 

where I models are to be compared and P(Uk\Y) denotes 
the posterior probability of the model Uk given Y. Here T= 
"pick the correct model" (see Hoeting et al., 1999, DiCic- 
cio et al., 1997). This approach is conceptually similar to 
ours, because it is based on the idea that the decision in 
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favor of or against a model should be investigated under the 
full scenario of possible models. Bayesian model averaging 
aims for this by averaging, whereas we compare all P-value 
curves among each other. However, in addition, we are in 
the position to decide whether the most appropriate model 
by such a rule should be chosen at all. Interestingly Hoeting 
(p. 399) points out that such an investigation for Bayesian 
model averaging would be of great interest. Another diffi- 
culty in Bayesian model selection consists in the determi- 
nation of priors. Observe, that our approach is based on a 
limit theorem, which holds for any error distribution of e, 
provided Var[e] < oo. It would be important to investigate 
more closely these relationships, however this is beyond the 
scope of this paper. 

5.2 Flaring of the MW disc 

As an example application of our method we have compared 
a parametric model of the MW luminosity distribution with 
a flaring vertical disc scale-height zo with a model without 
flaring in the disc. We find that the model with a flaring disc 
fits better the COBE/DIRBE L-band data than the model 
with constant Zq. We conclude that the stellar disc flares 
outside some inner radius Ri, which is significantly smaller 
than the radius of the solar orbit Rq. 

Can young supergiant stars unrelated to the bulk of the stel- 
lar population, or polycyclic aromatic hydrocarbon 3.3fim or 
dust emission be responsible for our result? Probably not, 
because Alard (2000) finds flaring of the disc outwards of 
the solar orbit from star count data. It seems improbable 
that near the solar circle the cause of the probably same 
phenomenon changes. Also it is believed that the NIR lumi- 
nosity probes the density of stars [BGS]. 
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